Analysis of first run of the benchmarking experiment

Author

György Barabás

1 Loading and tidying the data

We first set up some functions to load and tidy the raw data:

library(tidyverse)
library(broom)
library(ggfortify)
library(jsonlite)
library(knitr)
library(mblm)



serversFromConfig <- function(configFile = "../config.json") {
  jsonlite::fromJSON(configFile) |>
    as_tibble() |>
    select(contains("dl")) |>
    mutate(server = str_c("Server ", 1:3), .before = 1) |>
    rename_with(\(x) str_remove(x, "_dl_servers"), !server) |>
    pivot_longer(!server, names_to = "storage", values_to = "ip") |>
    mutate(storage = case_match(
      storage,
      "swarm" ~ "Swarm",
      "ipfs"  ~ "IPFS",
      "arw"   ~ "Arweave"
    ))
}


dataFromJsonRaw <- function(jsonFile = "../results.json") {
  jsonlite::fromJSON(jsonFile) |>
    as_tibble() |>
    unnest(tests) |>
    unnest(results)
}


dataFromJson <- function(jsonFile = "../results.json") {
  dataFromJsonRaw(jsonFile) |>
    mutate(sha256_match = (sha256_match == "true")) |>
    mutate(storage = ifelse(storage == "Ipfs", "IPFS", storage)) |>
    rename(time_sec = download_time_seconds) |>
    mutate(size_kb = as.integer(size)) |>
    select(!size & !server & !timestamp) |>
    left_join(serversFromConfig(), by = join_by(storage, ip)) |>
    relocate(size_kb, server, time_sec, attempts, sha256_match, .after = storage)
}

After loading and tidying the data, here’s what the first few rows of the table look like:

dat <- dataFromJson()

dat |>
  head(n = 10) |>
  kable()
storage size_kb server time_sec attempts sha256_match ip latitude longitude
Swarm 1 Server 1 157.7660 1 TRUE download.gateway.ethswarm.org 50.4779 12.3713
Swarm 1 Server 2 246.9573 1 TRUE 188.245.154.61:1633 49.4542 11.0775
Swarm 1 Server 3 157.5822 1 TRUE 188.245.177.151:1633 49.4542 11.0775
Swarm 1 Server 1 157.6129 1 TRUE download.gateway.ethswarm.org 50.4779 12.3713
Swarm 1 Server 2 220.8360 1 TRUE 188.245.154.61:1633 49.4542 11.0775
Swarm 1 Server 3 157.0362 1 TRUE 188.245.177.151:1633 49.4542 11.0775
Swarm 1 Server 1 163.7713 1 TRUE download.gateway.ethswarm.org 50.4779 12.3713
Swarm 1 Server 2 233.7285 1 TRUE 188.245.154.61:1633 49.4542 11.0775
Swarm 1 Server 3 157.0660 1 TRUE 188.245.177.151:1633 49.4542 11.0775
Swarm 1 Server 1 159.3709 1 TRUE download.gateway.ethswarm.org 50.4779 12.3713

We can do some sanity checks. First of all, every download succeeded:

dat |>
  count(sha256_match) |>
  kable()
sha256_match n
TRUE 1350

And the experiment is well balanced, with 30 replicates per size, server, and platform:

dat |>
  count(size_kb, server, storage) |>
  kable()
size_kb server storage n
1 Server 1 Arweave 30
1 Server 1 IPFS 30
1 Server 1 Swarm 30
1 Server 2 Arweave 30
1 Server 2 IPFS 30
1 Server 2 Swarm 30
1 Server 3 Arweave 30
1 Server 3 IPFS 30
1 Server 3 Swarm 30
10 Server 1 Arweave 30
10 Server 1 IPFS 30
10 Server 1 Swarm 30
10 Server 2 Arweave 30
10 Server 2 IPFS 30
10 Server 2 Swarm 30
10 Server 3 Arweave 30
10 Server 3 IPFS 30
10 Server 3 Swarm 30
100 Server 1 Arweave 30
100 Server 1 IPFS 30
100 Server 1 Swarm 30
100 Server 2 Arweave 30
100 Server 2 IPFS 30
100 Server 2 Swarm 30
100 Server 3 Arweave 30
100 Server 3 IPFS 30
100 Server 3 Swarm 30
1000 Server 1 Arweave 30
1000 Server 1 IPFS 30
1000 Server 1 Swarm 30
1000 Server 2 Arweave 30
1000 Server 2 IPFS 30
1000 Server 2 Swarm 30
1000 Server 3 Arweave 30
1000 Server 3 IPFS 30
1000 Server 3 Swarm 30
10000 Server 1 Arweave 30
10000 Server 1 IPFS 30
10000 Server 1 Swarm 30
10000 Server 2 Arweave 30
10000 Server 2 IPFS 30
10000 Server 2 Swarm 30
10000 Server 3 Arweave 30
10000 Server 3 IPFS 30
10000 Server 3 Swarm 30

Furthermore, most downloads succeeded in a single attempt, with only a few instances on Arweave where two download attempts were needed instead of one:

dat |>
  count(storage, attempts) |>
  kable()
storage attempts n
Arweave 1 447
Arweave 2 3
IPFS 1 450
Swarm 1 450

2 Preliminary analysis

Plotting the raw results, we get:

dat |>
  select(storage | size_kb | server | time_sec) |>
  mutate(storage = fct_reorder(storage, time_sec)) |>
  mutate(size = case_when(
    size_kb ==     1 ~ "1 KB",
    size_kb ==    10 ~ "10 KB",
    size_kb ==   100 ~ "100 KB",
    size_kb ==  1000 ~ "1 MB",
    size_kb == 10000 ~ "10 MB"
  )) |>
  mutate(size = fct_reorder(size, size_kb)) |>
  ggplot(aes(x = time_sec, color = storage, fill = storage)) +
  geom_density(alpha = 0.2, bw = 0.05) +
  scale_x_log10(breaks = c(10, 60, 360), labels = c("10s", "1m", "6m")) +
  labs(x = "Retrieval time", y = "Density",
       color = "Platform: ", fill = "Platform: ") +
  scale_color_manual(values = c("steelblue", "goldenrod", "forestgreen")) +
  scale_fill_manual(values = c("steelblue", "goldenrod", "forestgreen")) +
  facet_grid(server ~ size, scales = "fixed") +
  theme_bw() +
  theme(legend.position = "bottom", panel.grid = element_blank())

Here we have retrieval times (on the log scale) along the x-axis and density of incidence along the y-axis. The curves are higher where there are more data. Colors represent the different storage platforms; facet rows are the different servers used, and facet columns are the various data sizes.

At a glance, we see that IPFS is the fastest. For small files, Swarm is faster than Arweave. For larger files, it is a bit slower but still comparable.

What is strange is that there appears to be an “anti-pattern” whereby for IPFS, larger files lead to shorter retrieval times. Let us look at this more closely, and for all three platforms:

dat |>
  mutate(storage = fct_relevel(storage, "Swarm", "IPFS", "Arweave")) |>
  ggplot(aes(x = size_kb, y = time_sec)) +
  geom_point(color = "steelblue", alpha = 0.5) +
  geom_smooth(method = lm, color = "goldenrod", fill = "goldenrod") +
  scale_x_log10() +
  labs(x = "File size (KB)", y = "Download time (seconds)") +
  facet_grid(server ~ storage) +
  theme_bw()

We see that for both IPFS and Arweave, larger files lead to shorter download times. For Arweave and Server 1, this pattern appears reversed, but that is due to the outliers in the largest size category distorting the ordinary least-squares fit. Indeed, a median-based (Theil–Sen) regression detects a decreasing trend:

dat |>
  mutate(storage = fct_relevel(storage, "Swarm", "IPFS", "Arweave")) |>
  ggplot(aes(x = size_kb, y = time_sec)) +
  geom_point(color = "steelblue", alpha = 0.5) +
  geom_smooth(method = \(formula, data, weights) mblm(formula, data),
              color = "goldenrod", fill = "goldenrod") +
  scale_x_log10() +
  labs(x = "File size (KB)", y = "Download time (seconds)") +
  facet_grid(server ~ storage) +
  theme_bw()

An overall increasing trend is only seen for Swarm, but there the relationship between file size and download time is clearly nonlinear: times initially stagnate or even decrease slightly, before taking off again.

Otherwise, all fitted slopes are deemed to be very unlikely to be due to pure chance, as the p-values below show:

regressionDat <- dat |>
  mutate(size = log10(size_kb)) |>
  nest(data = !storage & !server) |>
  mutate(fit = map(data, \(dat) lm(time_sec ~ size, data = dat))) |>
  mutate(regtab = map(fit, broom::tidy)) |>
  unnest(regtab)

regressionDat |>
  select(!data & !fit) |>
  filter(term != "(Intercept)") |>
  kable()
storage server term estimate std.error statistic p.value
Swarm Server 1 size 52.29077 5.1876652 10.079827 0
Swarm Server 2 size 53.74020 3.3197602 16.187977 0
Swarm Server 3 size 55.93188 3.6941106 15.140826 0
IPFS Server 1 size -12.92478 0.2866705 -45.085838 0
IPFS Server 2 size -10.67000 0.3773290 -28.277707 0
IPFS Server 3 size -13.47224 0.3235440 -41.639594 0
Arweave Server 1 size 40.63760 5.5333705 7.344096 0
Arweave Server 2 size -15.68051 0.2631899 -59.578667 0
Arweave Server 3 size -15.67807 0.2632833 -59.548311 0

However, the assumptions behind linear regression do not hold well for Swarm and for Arweave under Server 1:

regressionDat |>
  filter(term != "(Intercept)") |>
  mutate(diagnostics = map(fit, \(x) {
    autoplot(x, smooth.colour = NA, alpha = 0.3, colour = "steelblue") +
      theme_bw()
  } )) |>
  mutate(diagnostics = pmap(list(diagnostics, storage, server), \(dia, sto, se) {
    gridExtra::grid.arrange(grobs = dia@plots, top = str_c(sto, ", ", se))
  } )) |>
  suppressMessages() |>
  capture.output() |>
  invisible()

For this reason, let us re-generate the regression tables, but using Theil–Sen linear regression instead. The results are comparable, except that the slope for (Arweave, Server 1) is reversed:

dat |>
  mutate(size = log10(size_kb)) |>
  nest(data = !storage & !server) |>
  mutate(fit = map(data, \(dat) mblm(time_sec ~ size, dataframe = dat))) |>
  mutate(regtab = map(fit, broom::tidy)) |>
  unnest(regtab) |>
  select(!data & !fit) |>
  filter(term != "(Intercept)") |>
  mutate(p.value = round(p.value, 5)) |>
  kable()
storage server term estimate std.error statistic p.value
Swarm Server 1 size 12.52489 10.005017 10887 0.00000
Swarm Server 2 size 38.18759 22.489381 11308 0.00000
Swarm Server 3 size 25.53666 11.496877 11295 0.00000
IPFS Server 1 size -13.15176 1.562340 0 0.00000
IPFS Server 2 size -11.46449 2.503900 0 0.00000
IPFS Server 3 size -13.46260 1.729444 0 0.00000
Arweave Server 1 size -13.09084 2.669667 3822 0.00056
Arweave Server 2 size -15.65500 1.436399 0 0.00000
Arweave Server 3 size -15.65599 1.428417 0 0.00000